AITopics | test plan

Collaborating Authors

test plan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation

Liu, Ethan TS., Wang, Austin, Mateega, Spencer, Georgescu, Carlos, Tang, Danny

arXiv.org Artificial IntelligenceMay-27-2025

Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and human security experts evaluated each response according to a rigorous scoring rubric emphasizing remediation (quality of the code fix, 50%), explanation (20%), and classification and test plan (30%) according to a standardized rubric. Our results show that current state-of-the-art LLMs achieve only moderate success on VADER - OpenAI's o3 attained 54.7% accuracy overall, with others in the 49-54% range, indicating ample room for improvement. Notably, remediation quality is strongly correlated (Pearson r > 0.97) with accurate classification and test plans, suggesting that models that effectively categorize vulnerabilities also tend to fix them well. VADER's comprehensive dataset, detailed evaluation rubrics, scoring tools, and visualized results with confidence intervals are publicly released, providing the community with an interpretable, reproducible benchmark to advance vulnerability-aware LLMs. All code and data are available at: https://github.com/AfterQuery/vader

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.19395

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Planning Reliability Assurance Tests for Autonomous Vehicles

Zheng, Simin, Lu, Lu, Hong, Yili, Liu, Jian

arXiv.org Artificial IntelligenceNov-30-2023

Artificial intelligence (AI) technology has become increasingly prevalent and transforms our everyday life. One important application of AI technology is the development of autonomous vehicles (AV). However, the reliability of an AV needs to be carefully demonstrated via an assurance test so that the product can be used with confidence in the field. To plan for an assurance test, one needs to determine how many AVs need to be tested for how many miles and the standard for passing the test. Existing research has made great efforts in developing reliability demonstration tests in the other fields of applications for product development and assessment. However, statistical methods have not been utilized in AV test planning. This paper aims to fill in this gap by developing statistical methods for planning AV reliability assurance tests based on recurrent events data. We explore the relationship between multiple criteria of interest in the context of planning AV reliability assurance tests. Specifically, we develop two test planning strategies based on homogeneous and non-homogeneous Poisson processes while balancing multiple objectives with the Pareto front approach. We also offer recommendations for practical use. The disengagement events data from the California Department of Motor Vehicles AV testing program is used to illustrate the proposed assurance test planning methods.

criteria, intensity, test plan, (16 more...)

arXiv.org Artificial Intelligence

2312.00186

Country:

North America > United States > California (0.34)
North America > United States > Florida > Hillsborough County > Tampa (0.14)
North America > United States > Arizona > Pima County > Tucson (0.14)
(3 more...)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.85)
Information Technology > Artificial Intelligence > Applied AI (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
(2 more...)

Add feedback

How to approach conversation design with Amazon Lex: Building and testing (Part 3)

#artificialintelligenceJan-5-2022, 20:22:02 GMT

In parts one and two of our guide to conversation design with Amazon Lex, we discussed how to gather requirements for your conversational AI application and draft conversational flows. In this post, we help you bring all the pieces together. You'll learn how draft an interaction model to deliver natural conversational experiences, and how to test and tune your application. In the second post of this series, you identified some use cases that you wanted to automate and wrote sample interactions between a user and your application. In this post, we use these use cases to build an Amazon Lex framework, called an interaction model, but first, let's review some important definitions.

interaction model, use case, utterance, (15 more...)

#artificialintelligence

Industry:

Banking & Finance (0.49)
Retail > Online (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Test and Evaluation Framework for Multi-Agent Systems of Autonomous Intelligent Agents

Lanus, Erin, Hernandez, Ivan, Dachowicz, Adam, Freeman, Laura, Grande, Melanie, Lang, Andrew, Panchal, Jitesh H., Patrick, Anthony, Welch, Scott

arXiv.org Artificial IntelligenceJan-25-2021

Test and evaluation is a necessary process for ensuring that engineered systems perform as intended under a variety of conditions, both expected and unexpected. In this work, we consider the unique challenges of developing a unifying test and evaluation framework for complex ensembles of cyber-physical systems with embedded artificial intelligence. We propose a framework that incorporates test and evaluation throughout not only the development life cycle, but continues into operation as the system learns and adapts in a noisy, changing, and contended environment. The framework accounts for the challenges of testing the integration of diverse systems at various hierarchical scales of composition while respecting that testing time and resources are limited. A generic use case is provided for illustrative purposes and research directions emerging as a result of exploring the use case via the framework are suggested.

hierarchy, interaction, software, (16 more...)

arXiv.org Artificial Intelligence

2101.1043

Country:

North America > United States > Virginia > Fairfax County > Fairfax (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (0.68)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Testing Features of ML Models - DZone AI

#artificialintelligenceAug-17-2018, 17:31:02 GMT

In this post, you will learn about different types of test cases that you could come up for testing features of the Data Science/Machine Learning models. Testing features are one of the key sets of which needs to be performed for ensuring the high performance of Machine Learning models in a consistent and sustained manner. Features make the most important part of a Machine Learning model. Features are nothing but the predictor variable, which is used to predict the outcome or response variable. Simply speaking, the following function represents y as the outcome variable and x1, x2, and x1x2 as predictor variables.

artificial intelligence, machine learning model, testing feature, (7 more...)

#artificialintelligence

Genre: Instructional Material (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback